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' >| 1 ! Abstract 

• The automatic extraction of acronyms and their meaning from corpora is an important 

^ ' sub-task of text mining. It can be seen as a special case of string alignment, where a text 

O ■ chunk is aligned with an acronym. Alternative alignments have different cost, and ideally the 

least costly one should give the correct meaning of the acronym. 

We show how this approach can be implemented by means of a 3-tape weighted finite- 
' state machine (3-WFSM) which reads a text chunk on tape 1 and an acronym on tape 2, and 

generates all alternative alignments on tape 3. The 3-WFSM can be automatically generated 
from a simple regular expression. No additional algorithms are required at any stage. Our 
. 3-WFSM has a size of 27 states and 64 transitions, and finds the best analysis of an acronym 

^—1 ' in a few milliseconds. 

O 

"1/5 ! 1 Introduction 

o 

The automatic extraction of acronyms and their meaning from corpora is an important sub-task of 
text mining. We will refer to it by the term acronym-meaning extraction. 

Much work has been done on it. To mention just some: Yeates, Bainbridge, and Witten (2000) 
matched an acronym against a text chunk, thus producing different candidate definitions for the 
acronym. They alternatively tried heuristic approaches, naive Bayes learning, and compression 
(i.e., shortest description length) to select the best candidate. Pustejovsky et al. (2001) added 
shallow parsing, and matched an acronym against a parsed (i.e., structured) text chunk. Schwartz 
and Hearst (2003) used a heuristic algorithm to deterministically find an acronym's meaning. 

The task can bes seen as a special case of string alignment between a text chunk and an 
acronym. For example, the text chunk "they have many hidden Markov models" can be aligned 
with the acronym "HMMs" in different ways, such as "they have many hidden Markov models" or 
"they have many hidden Markov models". Alternative alignments have different cost, and ideally 
the least costly one should give the correct meaning of the acronym. String alignment admits in 
general four different edit operations: insertion, deletion, substitution, and preservation of a letter 
(Wagner and Fischer, 1974; Pirkola et al., 2003). In the case of acronym-meaning extraction, 



only deletion and preservation can occur. String alignment has a worst-case time complexity of 
Cd^iH^I), with |si| and \s2 \ being the lengths of the two aligned strings, respectively. This also 
holds for the present special case. 

The purpose of this report is to show how the alignment-based approach to acronym-meaning 
extraction can be implemented by means of a 3-tape weighted finite-state machine (3-WFSM). 
The 3-WFSM will read a text chunk on tape 1 and an acronym on tape 2, and generate all possible 
alignments on tape 3, inserting dots to mark which letters are used in the acronym. For the above 
example this would give "they have many .hidden .Markov .models", among others. 

The 3-WFSM can be automatically generated from a clear and relatively simple regular ex- 
pression that defines the task in as much detail as we wish. Both the generation and the application 
of the 3-WFSM are done by (more or less) standard operations (Kempe, Champarnaud, and Eisner, 
2004), that are available in finite-state tools such as WFSC (Kempe et al., 2003). No additional 
algorithms are required at any stage, which reduces the development time to a few hours. 

2 Preprocessing the Corpus 

Using basic UNIX utilities, we first extract from a corpus all sentences that contain acronyms in 
brackets, such as 

Between a hidden Markov model (HMM) and a weighted finite-state 
automaton (WFSA) there is a correspondence. 

Then, we split these sentences into pairs consisting of an acronym and the text chunk that precedes 
it (starting from the sentence beginning or form the preceding acronym, respectively). For the 
above example, this is 

Between a hidden Markov model HMM 
and a weighted finite-state automaton WFSA 

Next, we normalize these pairs: capital letters are transformed into small ones, and separators into 
underscores: 

between_a_hidden_markov_model hmm 
and_a_weighted_finite_state_automaton wfsa 

3 Constructing an Acronym-Meaning Extractor 

We start by compiling a 4-WFSM over the real tropical semiring (K>o U {oo}, min, +, oo, 0), 
from the expression 

4 4) = ((<£,£,■ ,e),0) ((['-], r_], r-],a) {1=2=3} ,0) 

((["-],£, r-],i){i= 3 },0) U ((-,£,- 

where ["_] is a symbol class accepting any symbol except underscore, e represents the empty 
string, {l = 2 = 3}a constraint requiring the different r_] on tapes 1 to 3 to be instantiated by the 
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same symbol (Nicart et al, 2006), 1 and the 0's are weights. We use a superscript ( n ) to indicate 
the arity of a n-WFSM (Kempe, Champarnaud, and Eisner, 2004). 

If we apply with tapes 1 and 2 to a normalized text chunk and a corresponding acronym, 
respectively, and generate (neutrally-weighted) alternative analyses from tapes 3 and 4, we obtain, 
for example 



1, 2> they_have_many_hidden_markov_models 

3 , 4> t . hey_have_. many_hidden_. markov_model . s 

3 , 4> t . hey_have_. many_hidden_markov_. model . s 

3 , 4> t . hey_have_many_hidden_. mar kov_. model . s 

3 , 4> they_. have_. many_hidden_. markov_model . s 

3 , 4> they_. ha ve_. many_hidden_markov_. model . s 

3 , 4> they_. ha ve_many_hidden_. mar kov_. model . s 

3 , 4> they_have_many_. hi dden_. mar kov_. model . s 



hmms 



iaii_iii i_a i i i_i i i i i i_a iiiii_iiiiia 

iaii_iii i_a iii_iiiiii_iiiii i_a i i i i a 

i a i i_i i i i_i i i i_i i i i i i_a i i i i i_a i i i i a 

i i i i_a i i i_a i i i_i i i i i i_a iiiii_iiiiia 

i i i i_a i i i_a iii_iiiiii_iiiii i_a i i i i a 

i i i i_a iii_iiii_iiiii i_a i i i i i_a i i i i a 

iiii_iiii_iii i_a i i i i i_a i i i i i_a i i i i a 



On tape 3, letters of the text chunk which are used in the acronym, are preceded by a dot. Tape 
4 shows the performed operations: a meaning "acronym letter", i meaning "ignored letter". All 
analyses have neutral weight 0. 

By means of XFST (Karttunen, Gaal, and Kempe, 1998; Beesley and Karttunen, 2003) we 
generate from the following regular expression a 2-FSM (i.e., a non-weighted transducer), A' 2 ^ 2 \ 
that defines the operations more precisely: 
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(2) 



In this expression, all word-initial i are replaced by u ("unused word") if no letter of this word 
is used in the acronym, but if letters of preceding words are used. Then, all i of a used word 
are replaced by g ("gap letter") if they are followed by an a ("acronym letter") in the same word. 
Next, word-initial g are replaced by G ("word-initial gap letter"). Finally, all a are replaced by a 
number 1 to 8 , expressing their position in the word. Positions higher than 8 are marked as 8 . 

Furthermore, we generate with XFST another 2-FSM, A'^ 2 \ that deletes all letters of leading 
unused words, and the adjacent underscores: 



regex 

[%.|%_|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z]* 
.o. [ \%. -> I I .#. \%.* _ \%.* %_ ] 
.o. [ %_ -> I I .#. _ ] 

•o. [%.|%_|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z]* 



(3) 



'Roughly following (Kempe, Champarnaud, and Eisner, 2004), we employ here a simpler notation for constraints 
than in (Nicart et al., 2006). 
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(2) , (2) 

These two non-weighted 2-FSMs are transformed into 2-WFSMs, A 2 and A 3 , with neutral 
weight, and are joined (Kempe, Champarnaud, and Eisner, 2004) with the previously compiled 
4-WFSM aJ 4) : 

4 6) = (A? M {3 = 1} 4 2) ) * { 4=i } A? (4) 

In the resulting A^\ we have a modified form of tape 3 on tape 5 (describing analyses), and a 
modified form of tape 4 on tape 6 (describing operations). If we apply tape 1 to a normalized text 
chunk and tape 2 to a corresponding acronym, we obtain from tapes 5 and 6 for example 

1,2> they_have_many_hidden_markov_models hmms 

5,6> t . hey_have_many_hidden_. markov_. model . s G2ii_uiii_uiii_uiiiii_liiiii_lgggg6 

5,6> t . hey_have_. many_hidden_markov_. model . s G2ii_uiii_liii_uiiiii_uiiiii_lgggg6 

5,6> t . hey_have_. many_hidden_. markov_model . s G2ii_uiii_liii_uiiiii_liiiii_Ggggg6 

5,6> . have_many_hidden_. markov_. model . s iiii_liii_uiii_uiiiii_liiiii_lgggg6 

5,6> . have_. many_hidden_markov_. model . s iiii_liii_liii_uiiiii_uiiiii_lgggg6 

5,6> . have_. many_hidden_. markov_model . s iiii_liii_liii_uiiiii_liiiii_Ggggg6 

5,6> . hidden_. markov_. model . s iiii_iiii_iiii_liiiii_liiiii_lgggg6 



Finally, we assign costs (i.e., weights) to the different operations by means of a 1-WFSM 
generated from the expression 

4 1} = ((-,0}U(i,0)U(u,2)U(g,l}U(G,3)U 

(1, 0} U (2, 1) U (3, 1.5) U (4, 2} U (5, 2.5} U (6, 3} U (7, 3.5} U (8, 4)) (5) 

Here we chose the costs by intuition. In an improved approach they could be estimated from data. 
To obtain our acronym-meaning extractor, we join A^ with the previously compiled A^: 

Acrot ) = 4 6) M {6=1} 4 1} (6) 

If we apply Aero 1 - 6 ) with tapes 1 and 2 to a text chunk and a corresponding acronym, respec- 
tively, we obtain from tapes 5 and 6 for example 



1,2> they_have_many_hidden_markov_models hmms 



5, 6> 


t. 


• hey. 


_have_ 


_many_hidden_ 


. markov_ 


. model . 


. s 


G2ii. 


_u i i i_ 


_uiii_ 


.Ulllll. 


.liiiii. 


_lgggg6 
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17 


5, 6> 


t 


. hey_ 


.have. 
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.liiiii. 


_Ggggg6 


18 


5, 6> 
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5, 6> 
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5, 6> 






. have. 


_.many_hidden_ 


_.markov_ 
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.Iiii. 
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5, 6> 








. hidden_. 
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We select the analysis with the lowest cost by means of a classical single-source best-path algo- 
rithm such as Dijkstra's algorithm (Dijsktra, 1959) or Bellman-Ford's (Bellman, 1958; Ford and 
Fulkerson, 1956). If our weights have been optimally chosen, we should now obtain the correct 
analysis. 

In practice, input is read on tapes 1 and 2 and output generated from tapes 2 and 5, as in the 
following examples. All other tapes can therefore be removed, leaving us with a 3-WFSM Acro (3) . 



4 



1 , 2> they_have_many_hidden_markov_models hmms 

1, 2> between_hidden_markov_models hmms 

1, 2> and_weighted_f inite_state_automata wfsa 

1 , 2> and_weighted_f inite_state_automata wf a 

2,5> hmms . hidden_ . markov_ . model . s 

2,5> hmms . hidden_ . markov_ . model . s 

2 , 5> wf sa . weighted_ . f inite_ . state_ . automata 
2 ,. 5> wf a . weighted_. f inite_state_. automata 

4 Some Results 

We tested the acronym-meaning extractor on many examples. Finding the best analysis for one 
acronym took only a few milliseconds (ms) : For example, 3.7 ms for "they have many hidden 
markov models"-"hmms", 9.6 ms for "they have many hidden markov models they have many 
hidden markov models"-"hmms", and 12.0 ms for "they have many hidden markov models they 
have many hidden markov models"-"hmmshmms". 

The extractor had a size of 27 states and 64 transitions. 
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